docs: v0.3.2 overhaul — concepts + lifecycle + progressive disclosure case study#203
Merged
docs: v0.3.2 overhaul — concepts + lifecycle + progressive disclosure case study#203
Conversation
…sandbox hardening
Wholesale restructure now that v0.3.2 is shipped. Goals:
- audience routing from README so eval researchers / task authors / agent
builders / Harbor migrators land on the right page
- pin the mental model in one place (Trial / Scene / Role / Verifier +
full lifecycle diagram) instead of scattering it across api-reference
and use-cases
- promote the SWE-bench Pro / Josh @ GitHub progressive-disclosure case
study to a first-class section in progressive-disclosure.md, with
Harbor #1316 parity comparison and the soft-verify vs full-verify split
- give sandbox hardening its own page and route to labs/ for the
empirical validation
## Structure
README.md — audience routing, doc map, featured + research
CLAUDE.md — minimal: setup, conventions, release shape
docs/getting-started.md — was quickstart.md, slim, links forward
docs/concepts.md — NEW: primitives, lifecycle, multi-turn vs round vs scene
docs/progressive-disclosure.md — REWRITE: Josh case study, lifecycle integration,
soft- vs full-verify, Harbor #1316 parity table
docs/sandbox-hardening.md — NEW: threat model, hardening sequence, labs index
docs/task-authoring.md — unchanged (already solid)
docs/use-cases.md — unchanged (multi-agent patterns)
docs/skill-eval.md — was skill-eval-guide.md (rename only)
docs/examples/ — was docs/notebooks/ (rename only)
docs/reference/cli.md — was docs/cli-reference.md
docs/reference/python-api.md — was docs/api-reference.md
labs/ stays at repo root — runnable research code with relative imports;
referenced from docs/sandbox-hardening.md and README "Research artifacts."
## What's new content-wise
- docs/concepts.md (new): the five primitives (Task / Agent / Environment /
Verifier / Trial), trial lifecycle ASCII diagram, Scenes/Roles/Turns,
User abstraction summary, multi-turn vs multi-round vs multi-scene table.
- docs/progressive-disclosure.md (rewrite): full SWE-bench Pro case study
with the 2026-04-24 Daytona validation table, lifecycle integration
diagram showing where _run_user_loop plugs in, soft-verify vs full-verify
comparison, Harbor #1316 parity discussion, expanded API reference.
- docs/sandbox-hardening.md (new): the BenchJack/Meerkat threat context,
the 10-step hardening sequence, the per-task [verifier.hardening]
opt-out semantics, and labs/ as empirical validation.
- README.md: PyPI badge 0.3.0a3 → 0.3.2, audience routing table,
documentation index, featured progressive-disclosure callout, research
artifacts section linking labs.
- CLAUDE.md: trimmed to the essential conventions (test discipline,
human review, trunk-based, release shape).
Cross-references updated throughout (no broken links to old paths).
…re data
- Removed all "Josh @ GitHub/Microsoft" / "Josh's" references from docs,
README, example script, and notebook. Reframed as "the SWE-bench Pro
progressive-disclosure use case" with Harbor #1316 as the cited PR.
- Ran progressive disclosure (3 rounds, Gemini 3.1 Pro Preview, Daytona)
on all 5 oracle-passing SWE-bench Pro tasks. Results aggregated to
experiments/swebench-pro-progressive-results.json and rendered into
the notebook + docs:
ansible error: stdout closed at 17min
flipt 0.0 (195 tools, 3 rounds)
openlibrary 1.0 (82 tools, 3 rounds — soft 0.0 each, final 1.0)
navidrome 0.0 (145 tools, 3 rounds)
qutebrowser error: agent timeout at 50min
Honest take in the docs: infrastructure works, two infra failures
unrelated to disclosure, no measurable lift on flipt with this model
on this run. Single-model run, not a paper comparison.
- Fixed AttributeError in swebench_pro_user_dogfood.py (RunResult has
no trial_dir attribute) — script was crashing post-trial.
…h docs Fixes the two infra failure modes surfaced by the 5-task progressive disclosure run on Daytona: 1. ansible: 'Process closed stdout (rc=None)' after 17min with 0 tool calls 2. qutebrowser: 'Agent timed out after 3000s' with 0 tool calls Both came from agents hanging (no output, no progress) while the local subprocess wrapper was still alive — the existing error messages didn't make the failure mode actionable. ## Changes **src/benchflow/process.py**: when stdout returns EOF, distinguish 'local subprocess still alive but transport closed' (rc=None — Daytona idle sleep, SSH drop, agent hung) from 'local subprocess actually exited' (rc set). Surface the distinction in the error message. **src/benchflow/_acp_run.py**: new `idle_timeout` parameter on `execute_prompts()` and a `_prompt_with_idle_watchdog()` helper. Polls `session.tool_calls` every few seconds and aborts the prompt if no new tool call arrives for `idle_timeout` seconds. Catches the qutebrowser- style hang where the agent connects, never produces a tool call, and chews through the full agent timeout (50min in our case). **src/benchflow/trial.py**: new `TrialConfig.agent_idle_timeout` field (default 600s = 10min), wired through to `execute_prompts()`. Tasks / callers can override with `None` to disable, or with a higher number for tasks that legitimately spend long stretches in agent thinking. **docs/getting-started.md**: OAuth / subscription auth section. Lists the three agents that pick up host CLI logins (claude-agent-acp, codex-acp, gemini) and the detect files for each. 'No API key needed if you ran `claude login`'. ## Validation - ruff clean - 88 tests pass (test_user, test_process, test_sandbox_hardening) - Will re-run ansible + qutebrowser progressive disclosure to confirm the new idle timeout aborts them cleanly with a clear error message
…p-token) Per Anthropic Claude Code authentication docs, the third auth path users have is generating a 1-year OAuth token with 'claude setup-token' and setting CLAUDE_CODE_OAUTH_TOKEN. This is the right option for CI / headless / sandbox environments where browser login isn't available. Reorganized the auth section into three numbered options: 1. Host CLI login (subscription_auth, file detection) 2. Long-lived CLAUDE_CODE_OAUTH_TOKEN env var (Claude only) 3. API key (works with every agent) Plus a precedence note from Anthropic's auth docs. benchflow already auto-inherits CLAUDE_CODE_OAUTH_TOKEN per src/benchflow/_agent_env.py:63 — this is just docs catching up.
After the 'agent_idle_timeout + EOF diagnostics' fix, re-ran ansible and qutebrowser (the two that flaked on first attempt). Both completed 3 rounds and reached final reward 1.0. Final 5-task results (Gemini 3.1 Pro, Daytona, 3 rounds each): Task Final Tools Rounds soft-verify Notes ansible 1.0 126 0.0 / 0.0 / 0.0 passed on retry (1st: stdout EOF) flipt 0.0 195 0.0 / 0.0 / 0.0 hard fail openlibrary 1.0 82 0.0 / 0.0 / 0.0 baseline already passed navidrome 0.0 145 0.0 / 0.0 / 0.0 hard fail qutebrowser 1.0 183 0.0 / 0.0 / 0.0 passed on retry (1st: 50min timeout) 3/5 final pass. flipt and navidrome stayed at 0.0 across all rounds — Gemini 3.1 Pro doesn't crack them with this hint schedule. Updated: - experiments/swebench-pro-progressive-results.json - examples/swebench_pro_progressive_disclosure.ipynb (re-executed with new data) - docs/progressive-disclosure.md validation table + commentary
Three findings, all real:
1. docs/reference/python-api.md:230 — relative link broken after the
docs/api-reference.md → docs/reference/python-api.md move. Fix: use
../examples/ instead of docs/examples/.
2. _acp_run.py _prompt_with_idle_watchdog — race condition: after
`await asyncio.sleep(poll_interval)`, prompt_task could have
completed during the sleep. Without re-checking `done()` before the
timeout evaluations, we'd cancel a completed task and silently
discard a successful result. Added a `done()` re-check that breaks
out of the loop.
3. trial.py:674 — the TimeoutError handler was overwriting the idle
watchdog's detailed message ("Agent idle for 600s with no new tool
call (last activity 642s ago, 0 tool calls so far)") with a generic
"Agent timed out after {self._timeout}s" using the wall-clock
budget, not the idle timeout. Preserve the watchdog's message when
present; fall back to the generic message only when the exception
has no detail.
Devin caught: _prompt_with_idle_watchdog created prompt_task with asyncio.create_task() but only cancelled it on the explicit timeout branches. If the parent coroutine was cancelled externally (asyncio.timeout, task.cancel(), Ctrl+C), CancelledError propagated out of the sleep without cancelling prompt_task — leaking the agent prompt until Trial.cleanup() eventually killed the process, plus asyncio's "Task exception was never retrieved" warning. Wrap the polling loop in try/finally so cancel + drain always runs, including the implicit-cancellation path. Both timeout branches now just `raise TimeoutError(...)` without their own cancel/drain block — the finally handles it uniformly.
Devin caught: the execute_prompts docstring says idle_timeout fires when "no tool call OR message arrives", but _prompt_with_idle_watchdog was only polling session.tool_calls. ACPSession also accumulates message_chunks and thought_chunks via handle_update — agents actively streaming text without producing a new tool call would be falsely aborted. Use a single _activity_count() that sums tool_calls + message_chunks + thought_chunks so any of the three resets the idle timer. Updated the TimeoutError message to mention all three categories.
Single source of truth for runnable examples + teaching notebooks.
Previous split (docs/examples/ for teaching, examples/ for scripts)
was arbitrary — both directories held the same kinds of files (.py
demo scripts and .ipynb notebooks), and the duplication confused
readers about where to look.
Now: examples/ at repo root holds everything; docs/ has no examples
subdir. labs/ stays separate at repo root for research artifacts
(separate purpose: validation + reproducible experiments).
Files moved (git mv):
docs/examples/coder-reviewer-demo.py → examples/
docs/examples/scene-patterns.{ipynb,md,py} → examples/
docs/examples/nanofirm-task/ → examples/
References updated:
README.md — single "examples/" link
examples/scene-patterns.md — `python docs/examples/...` → `python examples/...`
examples/coder-reviewer-demo.py — same
examples/scene-patterns.py — same
Devin caught: poll_interval was computed solely from idle_timeout (min(30, max(5, idle_timeout // 4))). With the default idle_timeout=600, poll_interval was always 30s. The wall-clock deadline was only checked after each `await asyncio.sleep(poll_interval)`, so a task with timeout_sec=60 could overshoot to 90s (50%); timeout_sec=30 could overshoot to 60s (100%). Pre-PR, execute_prompts used asyncio.wait_for() which enforced the wall-clock timeout precisely. Adding the idle watchdog as the new default path silently regressed timeout precision. Fix: factor timeout into the poll interval too — `min(30, idle_timeout // 4, max(1, timeout // 4))` floored at 1. Short total budgets now get proportionally shorter poll intervals.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
Now that v0.3.2 is shipped (BaseUser, hardening opt-outs, DinD compose, lint cleanup), the docs need to catch up. They were sprawling, version-stale (README badge said v0.3.0a3), missing a mental-model page, and didn't surface the SWE-bench Pro / Josh @ GitHub case study or the labs/ research artifacts.
What changed
New files:
Rewritten:
Renamed (no content change):
Unchanged:
Validation
Out of scope (follow-ups)
Test plan